KEMBAR78
feat: new schema based checker by pm100 · Pull Request #761 · dotenv-linter/dotenv-linter · GitHub
Skip to content

Conversation

@pm100
Copy link
Contributor

@pm100 pm100 commented Mar 5, 2024

I do not expect this to be accepted as is. I am sure there will be feedback

There are a couple of big changed to the structure of the code

First I have unified the 3 different Options structs (Cli, Fix, Compare) into one CliOptions. Why, becuase its seems useful to pass the options to the checkers, rather than pull out one field (skips). But fix accepts options but calls check and fix, so either I had to synthesize a checkoptions from the fixoptions, or just make it one options type.

Next, I have added a call to the Check trait that allows them to be called after all lines have been processed. Useful for detecting missing things. Check::run loops over the checker at the end calling this new entry point.

Features

  • specify what keys are allowed
  • specify the type, String, Number, Url, Email, Boolean
  • specify required or not
  • specify regex for string
  • specify if other keys other than those listed are allowed.

sample schema

{
    "version": "1.0.0",
    "allow_other_keys": true,
    "entries": [
        {
            "key": "SECRET",
            "required": true,
            "type": "String",
            "regex": "^[a-z]*$"
        },
        {
            "key": "PORT",
            "atype": "Number"
        },
        {
            "key": "URL",
            "type": "Url"
        },
        {
            "key": "EMAIL",
            "type": "Email"
        }
    ]
}

to use include -S | --schema on command line

if schema not found or is invalid then error message is printed and app exits

Not done yet

  • allowed set of values for a string or number
  • allowed range for numeric

Full set of tests included

@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2024

Codecov Report

Attention: Patch coverage is 98.85714% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 99.11%. Comparing base (5b6de3a) to head (6dbed6a).
Report is 10 commits behind head on master.

Files Patch % Lines
dotenv-linter/src/schema.rs 98.29% 4 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #761      +/-   ##
==========================================
- Coverage   99.15%   99.11%   -0.04%     
==========================================
  Files          46       48       +2     
  Lines        2834     3162     +328     
==========================================
+ Hits         2810     3134     +324     
- Misses         24       28       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pm100
Copy link
Contributor Author

pm100 commented Mar 8, 2024

i have fixed the missing coverages, but the coverage ci test needs to be rerun

Copy link
Member

@DDtKey DDtKey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!
I really like this to be part of dotenv-linter! 🚀

I've left some comments for first pass

@dotenv-linter/core please take a look

@DDtKey DDtKey changed the title new schema based checker feat: new schema based checker Mar 9, 2024
@DDtKey
Copy link
Member

DDtKey commented Mar 9, 2024

Another idea in my mind in continue to this one, is to allow deriving of the schema from .env files.

To provide a way for users to generate a schema from existing files - they can be large. And then it can be edited manually.

This is definitely not for this PR.
But highlights that it makes much more sense to extract schema logic to sub crate. It can be used even outside of dotenv-linter

@pm100
Copy link
Contributor Author

pm100 commented Mar 9, 2024

I have one failing test because the text returned by failure to open a file is different on windows (My main dev system) and linux. I will try to fix,

You code coverage tool is a hard taskmaster - but very good at making sure everything is tested.

I think I have fixed it

Comment on lines 13 to 26
pub struct DotEnvSchema {
pub version: String,
pub allow_other_keys: bool,
pub entries: HashMap<String, SchemaEntry>,
}
#[derive(Deserialize, Default)]
#[serde(default)]
pub struct SchemaEntry {
pub key: String,
pub required: bool,
#[serde(rename = "type")]
pub value_type: SchemaValueType,
pub regex: Option<String>,
}
Copy link
Member

@DDtKey DDtKey Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do conversion of elements to map anyway, what do you think of using enums?

It also would introduce some basic validation on schema reading (of the schema itself)

I mean something like:

enum EnvVariableKind {
  String { regex },
  Email,
  Nunber { min, max },
  ..
}

So we will have HashMap<String, EnvVariableKind> after validation of the schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this too, I dont think so though. The only useful one at the moment is String(Option<Regex>) . If the idea is to encapsulate the entire definition of the 'rule' in the enum they get too complex. Imagine rules that say

  • either KEY1 or KEY2 must be present but not both
  • if KEY1 present then KEY2 must also be present
  • these are the allowed values for KEY1 (could be a regex)
  • ...

The code then still needs to look at other things in the schema definition , not just the type enum so my vote is no.

Copy link
Member

@DDtKey DDtKey Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need type-independent checks, it could be more extended wrapper, like:

// entries: HashMap<String, EnvValidationRules>

struct EnvValidationRules {
   ty: EnvVariableType, // contains specific validations for the type
   required: bool,
   excludes: Vec<String>, // keys which can't present in case of this one found (just for example)
   ... // other common rules which doesn't depend on type
}

The main point here that there are inappropriate configs for particular types (i.e depends only on type).
For now only Regex for string, but we already have understanding of validations for numbers and can have more. UPD: see below suggestion for boolean, one more possible type-specific rule

It's kinda a good way to ensure schema is passed in correct way, and let's say there is no such confusing definitions:

        {
            "key": "SECRET",
            "required": true,
            "type": "Number",
            "regex": "^[a-z]*$"
        },

Another benefit - the logic will be more separated, for type specific checks it can be just a method on enum, so we can do something like that: ty.validate(value)

And in general this type can be de-serialized with serde as is, so no need to manually map them

Comment on lines +129 to +133
Arg::new("schema")
.short('S')
.long("schema")
.help("Use schema file to check .env files")
.value_parser(value_parser!(PathBuf)),
Copy link
Member

@DDtKey DDtKey Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, it can be useful to have an ability to pass schema as an in-place argument (json string) to the CLI.

E.g it can be used when you're piping commands

But we don't have to do this initially, can be separate issue/PR to support this later

@pm100
Copy link
Contributor Author

pm100 commented Mar 9, 2024

I also messed up the fix for that missing file test, hopefully I have it correct now

@DDtKey
Copy link
Member

DDtKey commented Mar 9, 2024

I apologize for a number of comments and suggestions. I understand that this may be annoing.
This is just a really great feature, but implementation is kinda tricky part here, it should be extensible.

Many of items can be done in iterative way, in separate PRs. Thus, we just need to collect features which we gonna schedule for the future enhancement. It also would allow other folks to contribute 🙂

Thanks for all efforts addressing everything! 🙏

@pm100
Copy link
Contributor Author

pm100 commented Mar 11, 2024

@DDtKey are you waiting for me to do something. I think I have done all that you asked, I have all checks passing and 100% coverage

@DDtKey
Copy link
Member

DDtKey commented Mar 11, 2024

Actually I'm not the only maintainer, so we need @dotenv-linter/core to review as well (only @mgrachev can merge / make final decision here).

But from my side, I still think it would be nice this part to be implemented 🤔
I also think it can be done on deserialization level (with internally tagged enum where tag is type), this way we will have some basic validations of the schema structure even before applying any logic.

@pm100
Copy link
Contributor Author

pm100 commented Mar 11, 2024

I dont think serde can do it when we are not deserializing the enums directly. I can make serde do it directly from a different json syntax that I dont like so much

[ 
  "key":"NAME",
   "type":{"String" :null}
]

or

[ 
  "key":"NAME",
   "type":{"String" :"regex"}
]

which is very odd.

I can change the enum to

pub enum SchemaValueType {
    // #[default]
    String(Option<String>),
    Integer,
    Float,
    #[default]
    Boolean,
    Url,
    Email,
}

if you like ( I dont think it adds anything)

I will have to remap it manually - which is fine, just like the vec to hashmap translation

I will then reject a regex on anything other than a string. Or I can simply do that without changing the enum

@DDtKey
Copy link
Member

DDtKey commented Mar 12, 2024

That's possible, did you consider to use flatten?
Here is you can find working example on rust's playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a272538dd8572b664bc4494804cdd3a5

Working representation
#[derive(Deserialize, Debug)]
pub struct SchemaEntry {
    pub key: String,
    pub required: bool,
    #[serde(flatten)]
    pub ty: SchemaValueType,
}

#[derive(Deserialize, Debug)]
#[serde(tag = "type")]
pub enum SchemaValueType {
    String {
        regex: Option<String>
    },
    Integer,
    Float,
    Boolean,
    Url,
    Email,
}

if you like ( I dont think it adds anything)

I believe this makes the code safer & stricter since it is not possible to get specific fields of a type from branches for other types (e.g regex should not be available when you process Integer, or min/max under String).

Moreover, in the future we may end up with a monster type that contains many of these type-specific rules, but they do not belong to anything (only to high-level struct). Thus, it may be difficult for contributors to understand when and what to use (confuse common rules with specific ones)

@mgrachev
Copy link
Member

@pm100 Thank you for your contribution 👍

Sorry, but I need more time to review your PR, so I'll look at it around the weekend.

match schema::DotEnvSchema::load(schema_path) {
Ok(schema) => Some(schema),
Err(e) => {
println!("Error loading schema: {}", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we print the error ignoring quiet mode flag, I think it should be stderr rather than stdout then 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was stderr originally but the cli tests only look at stdout

@mgrachev mgrachev linked an issue Mar 18, 2024 that may be closed by this pull request
@pm100
Copy link
Contributor Author

pm100 commented Mar 19, 2024

All done except CliOptions - see my comment

pub fn load(path: &Path) -> Result<Self, std::io::Error> {
let file = File::open(path)?;
let reader = BufReader::new(file);
let read_schema: ReadDotEnvSchema = serde_json::from_reader(reader)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let read_schema: ReadDotEnvSchema = serde_json::from_reader(reader)?;
let read_schema: DotEnvSchema = serde_json::from_reader(reader)?;

Why do you use Read* structs if deserialization directly into structure DotEnvSchema works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not work, it did originally but once the entries was changed to HashMap and the regex was compiled into the schema at load time it is no longer possible

Copy link
Member

@mgrachev mgrachev Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the schema format to this one, it will work:

{
  "version": "1.0",
  "allow_other_keys": false,
  "entries": {
    "SECRET": {
      "required": true,
      "type": "String",
      "regex": "^[a-z]*$"
    },
    "PORT": {
      "type": "Integer"
    },
    "URL": {
      "type": "Url"
    }
  }
}

Working example:

#[derive(Deserialize, Default)]
pub struct DotEnvSchema {
    pub version: String,
    pub allow_other_keys: bool,
    pub entries: HashMap<String, SchemaEntry>,
}

#[derive(Deserialize, Default)]
pub struct SchemaEntry {
    #[serde(default)]
    pub required: bool,
    #[serde(rename = "type")]
    pub value_type: SchemaValueType,
    #[serde(with = "serde_regex", default)]
    pub regex: Option<Regex>,
}

#[derive(Deserialize, Default)]
pub enum SchemaValueType {
    #[default]
    String,
    Integer,
    Float,
    Boolean,
    Url,
    Email,
}

impl DotEnvSchema {
    pub fn load(path: &Path) -> Result<Self, std::io::Error> {
        let file = File::open(path)?;
        let reader = BufReader::new(file);
        let schema: DotEnvSchema = serde_json::from_reader(reader)?;

        Ok(schema)
    }
}

@DDtKey What do you think about the new schema format?

Copy link
Member

@DDtKey DDtKey Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm totally good with that. Actually I mentioned this way earlier:

Also alternative to consider: instead of array, try to use Object in JSON, where key is env variable name 🤔 This can be deserialized into map

Any valid Unicode string is valid json key - so we should have no issues with that

Copy link
Member

@DDtKey DDtKey Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another "unresolved" question regarding schema deserialization part is #761 (comment) (also there is an example in rust-playground)

I tend to keep type-specific rules attached to the type itself (like regex belongs to string). Just to keep our code more extensible & clear for the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another "unresolved" question regarding schema deserialization part is #761 (comment) (also there is an example in rust-playground)

I tend to keep type-specific rules attached to the type itself (like regex belongs to string). Just to keep our code more extensible & clear for the future

I like your idea 👍

Comment on lines +91 to +94
for ch in &mut checks {
let end_warns = ch.end();
warnings.extend(end_warns);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for ch in &mut checks {
let end_warns = ch.end();
warnings.extend(end_warns);
}
if disabled_checks.contains(&LintKind::SchemaViolation) {
return warnings;
}
if let Some(schema) = schema {
let keys = lines.iter().filter_map(|l| l.get_key()).collect::<Vec<&str>>();
let required_warnings = schema_violation::check_required_keys(schema, &keys);
warnings.extend(required_warnings);
}

It seems to me that it's more obvious what's going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly related to #761 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'end' call was intended as a generic call for any checkers that may want to do a final pass of some kind

This now elevates the schema checker to a special case

Comment on lines 124 to 138
fn end(&mut self) -> Vec<Warning> {
let mut warnings = Vec::new();
if let Some(schema) = self.schema {
for (key, entry) in &schema.entries {
if entry.required && !self.seen_keys.contains(key) {
warnings.push(Warning::new(
self.last_line_number,
self.name(),
format!("The {} key is required", entry.key),
));
}
}
}
warnings
}
Copy link
Member

@mgrachev mgrachev Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn end(&mut self) -> Vec<Warning> {
let mut warnings = Vec::new();
if let Some(schema) = self.schema {
for (key, entry) in &schema.entries {
if entry.required && !self.seen_keys.contains(key) {
warnings.push(Warning::new(
self.last_line_number,
self.name(),
format!("The {} key is required", entry.key),
));
}
}
}
warnings
}
pub(crate) fn check_required_keys(schema: &crate::schema::DotEnvSchema, keys: &[&str]) -> Vec<Warning> {
let mut warnings = Vec::new();
for (key, entry) in &schema.entries {
if entry.required && !keys.contains(&key.as_str()) {
warnings.push(Warning::new(
keys.len() + 1,
LintKind::SchemaViolation,
format!("The {} key is required", entry.key),
));
}
}
warnings
}

The end function is not too obvious, so I propose this solution.

Copy link
Member

@DDtKey DDtKey Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think there is a better alternative.

  • extract all required keys from schema in SchemaViolationChecker::new and keep as HashSet
  • every call of run we remove the key from the hashset
  • If at the end there are keys left in the hashset, we return warnings for them.

This must be more efficient than going through the keys we collected one more time. Also, we won't be forced to keep all the keys from the file in memory.

This will still require the checker to have a callback, much like now, but instead of end I would suggest something like remaining_warnings or so. I believe this is even better, because instead of hardcode we will get expandable functionality for checkers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'end' call is intended as a generic callback for all checkers that may want to do a final pass of some kind

Removing the keys once seen will prevent the validation of duplicated entries, ie regex check... The second one may be the 'intended' one

Copy link
Member

@DDtKey DDtKey Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the keys once seen will prevent the validation of duplicated entries, ie regex check... The second one may be the 'intended' one

Didn't get it. What do you mean? Could you provide an example?

I expect to have something like required_keys: HashSet<String> and we remove the keys from it once seen. It shouldn't be used for anything but required keys

If required key present - we remove it from the set. The logic is almost the same as current one with second pass through all keys.

the 'end' call is intended as a generic callback for all checkers that may want to do a final pass of some kind

Regarding this one, as I said, callback makes sense in general - especially with suggested approach, but would be nice to have better naming (see comment above).

@pm100
Copy link
Contributor Author

pm100 commented Mar 19, 2024

reverted all Cli changes

using serde_regex

@pm100
Copy link
Contributor Author

pm100 commented Mar 20, 2024

Why not accept the PR, which is fully featured and passes 100% test coverage - as far as I all the proposed changes are stylistic choices - then change it however you like. There are tons of extra features that can be added once the base support is in:

  • use the schema for fix and compare
  • add numeric ranges
  • add min and max lengths
  • add allowed values
  • add support for dir and file type , both syntactically valid and optionally check for existence
  • improve booleans (localizable???)

Or close it

@DDtKey
Copy link
Member

DDtKey commented Mar 21, 2024

I think there is no point in merging the PR with an unfinished schema. This is the most important part of PR. It’s not stylistic choice - schema is user-facing feature, and changes may lead to major updates which we can avoid easily at this point of time.

If we accept it unfinished, then we need someone to work on it to get the job done.

We are not asking for additional features, just changes to an important part. I don't think merging the intermediate state of a feature is a good idea.

Even if we did, we wouldn't be able to release it until we were ready to introduce the schema-format without making incompatible changes immediately afterwards.

But you have already done a huge amount of work (thank you!) and are familiar with how this can be changed

@pm100
Copy link
Contributor Author

pm100 commented Mar 21, 2024

I do not know what changes to the schema are being asked, I only see implementation changes. Can somebody please post the proposed schema syntax and I will implement it

@DDtKey
Copy link
Member

DDtKey commented Mar 21, 2024

Sure, the suggestion was in the thread above #761 (comment)

@pm100
Copy link
Contributor Author

pm100 commented Mar 21, 2024

the problem with that schema is that it is not describable in a json schema itself, so json aware editors wont like it.

The only reason for the proposed change is to slightly simplify the implementation on reading the schema. Which isnt a compelling reason IMHO

@DDtKey
Copy link
Member

DDtKey commented Mar 24, 2024

the problem with that schema is that it is not describable in a json schema itself, so json aware editors wont like it.

Didn't really get your point. It's valid JSON and any editor with JSON support works with this (at least I checked my editors).
I see no reason to be attached to JSON Schema, but it should be also supported with patternProperties if we wish

The only reason for the proposed change is to slightly simplify the implementation on reading the schema. Which isnt a compelling reason IMHO

Well, I see at least couple of reasons:

  • any JSON linter/validator will highlight duplicative keys. We won't have to deal with this explicitly. Can be controlled on de-serialization level
  • It has a good representation in yaml (we may want to support yaml or toml in the future, it's easy to do with serde), for example above:
version: '1.0'
allow_other_keys: false
entries:
  SECRET:
    required: true
    type: String
    regex: "^[a-z]*$"
  PORT:
    type: Integer
  URL:
    type: Url

@pm100
Copy link
Contributor Author

pm100 commented Mar 24, 2024

ok, I can change to that format. But its much harder to detect duplicated keys - which I have to do and forgot in the original code.

The problem is that serde happily accepts duplicates. Ie this

{
    "version": "1.0.0",
    "allow_other_keys": false,
    "entries": {
        "SECRET": {
            "required": true,
            "type": "String",
            "regex": "^[a-z]*$"
        },
        "PORT": {
            "type": "String"
        },
        "PORT": {
            "type": "Integer"
        }
    }
}

It means that I have to deserialize some other way

see https://stackoverflow.com/questions/64405600/how-can-i-deserialize-json-that-has-duplicate-keys-without-losing-any-of-the-val

And since the only real purpose of this change was to make the deserialze easier I dont really like it.

BTW - it is possible tp describe this format using json schema - it just wasnt obvious

@DDtKey
Copy link
Member

DDtKey commented Mar 25, 2024

The problem is that serde happily accepts duplicates.

Most common solution is to use serde_with for this:
https://docs.rs/serde_with/latest/serde_with/rust/maps_duplicate_key_is_error/index.html

Just declare the field with serde_as macro:

#[serde_as(as = "MapPreventDuplicates<_, _>")]

That's why I mentioned that we can get duplicate-check for free.

And since the only real purpose of this change was to make the deserialze easier I dont really like it.

Please stop claiming that this is the only reason
Above you see two more reasons at least. Also, I suggested this before we even got to the implementation discussion

@DDtKey
Copy link
Member

DDtKey commented Mar 25, 2024

we get duplicate check for free in some editors, not at run time. So with this schema json dotenv-linter will accept schemas with duplicated keys, which is not what you want (I assume

Sorry, but did you read what I suggested? (don't consider it rude, just can't follow the conversation)
There is a ready solution to disallow duplicative keys on deserialization step with serde_with

@pm100
Copy link
Contributor Author

pm100 commented Mar 25, 2024

done

Copy link
Member

@DDtKey DDtKey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Couple of iterations before releasing should be performed

@mgrachev mgrachev merged commit 6b3db75 into dotenv-linter:master Mar 26, 2024
@mgrachev
Copy link
Member

@pm100 Thank you for your help 👍

@mgrachev mgrachev mentioned this pull request Aug 6, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

schema driven checker

4 participants